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Abstract 

The Neighbor-Joining algorithm is a recursive procedure for re- 
constructing trees that is based on a transformation of pairwise dis- 
tances between leaves. We present a generahzation of the neighbor- 
joining transformation, which uses estimates of phylogenetic diversity 
rather than pairwise distances in the tree. This leads to an improved 
neighbor-joining algorithm whose total running time is still polynomial 
in the number of taxa. On simulated data, the method outperforms 
other distance-based methods. 

We have implemented neighbor-joining for subtree weights in a 
program called MJOIN which is freely available under the Gnu Public 
License at 

http : / /bio ■ math . berkeley . edu/mj oiii/| 
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1 Introduction 



Distance based methods for phylogenetic reconstruction are based on the 
observation that edge weighted phylogenetic X-trees (trees that have a set 
X as their leaves, all interior vertices of degree at least three and non-negative 
weights wt '■ E{T) — > M>o on every edge) can be encoded by certain metrics 
on X. 

Theorem 1 (Four-point condition | Buneman, 1971] ). Given a 
metric D : X x X ^ M there exists an edge weighted phylogenetic X-tree T 
such that D{i,j) = Y.eeE{T) ^Ae) iff 

D{i,j) + D{kJ) < max{D{i,k) + D{j,l),D{j,k) + D{i,l)) 

for every four leaves i,j, k, I. Furthermore, T is unique. 

Such metrics are called tree metrics and many methods have been pro- 
posed for projecting dissimilarity maps (functions : X x X — M with 
D{x,x) = and D{x,y) = D{y,x)) to "nearby" tree metrics. The neighbor- 
joining algorithm, introduced by [Saitou and Nei, 1987| , is the most popular 
and widely used. It is particularly convenient for reconstructing phyloge- 
netic trees when the size of X is large, in which case methods that require an 
exhaustive exploration of the space of trees are computationally prohibitive. 

There are four parts to the neighbor-joining algorithm (see algorithm 1): 

1. A procedure for estimating pairwise distances between elements of X. 
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2. A criterion for identifying neighboring pendant edges (cherries) in a 
tree. 

3. A recursive reduction. 

4. A branch length estimation formula. 

The cherry picking criterion is based on the following theorem: 



Theorem 2 ( |Saitou and Nei, 19871 [Studier and Keppler, 1988| ). // 

D is a tree metric and 

Qd{i,j) = (n - 2)D{i,j) - ^) " E ^^^^ ^) 

then the pair x,y that minimizes QD{x,y) is a cherry in the tree. 

Although the exact formula for Q may seem a bit mysterious at first, 
it is a very natural criterion. For example, the neighbor-joining algorithm 
which is based on it is consistent (i.e. if D is a tree metric then the al- 
gorithm returns the tree), the input order of the taxa does not change the 
outcome of the algorithm, and the criterion is a linear function of the dis- 
tances. Br yant, 2005| has recently shown that the neighbor-joining selection 
criterion Q{i,j) is the only one satisfying the properties above. Further- 
more, [Gascuel, 1997b| has shown that the neighbor-joining criterion can be 
interpreted as greedily minimizing a balanced minimum evolution criterion 
which provides added understanding as to why it has been a very successful 
method. 



The recursive reduction step and branch length estimation formula have 
been examined extensively and have resulted in a number of improvements 
to the basic neighbor- joining algorithm. For example, the reduction step 
has been extensively investigated and has been shown to be optimal when 
variances on the estimates are unknown, yet improvable when variance in- 
formation is incorporated [Gascuel, 1994 Gascuel, 1997a[ [Gascuel, 1997b| . 



Algorithm 1: Neighbor-joining algorithm 



Data : A set X together with sequences corresponding to the 

elements of X 
Result: Edge weighted phylogenetic X-tree T 
for i,j E (f) do 

Compute the maximum likelihood distance D{i,j) between taxa i 
and j; 
end 

while |X| > 2 do 

for I, J e (2) do 

Set 

QD{t,j) = (|X|-2)Z}(^,j)-Efcex\m^(^,fc)-Efcex\w^(^j)- 
end 

Choose a pair x,y E X that minimizes QD{x,y); 
Add a new element z\x\ to the set X and remove x and y; 

Let u\x\ = X and v\x\ = V-] 
Set D{i, z\x\) = ^Dii, x) + D{i, y) - D{x, y)); 
end 

while |X| < n — 2 do 

Set D{u\x\,z\x\) = 

Efc^„|^|,^,|^| D{u\x\,k) + D{u\x\,v\x\) - D{k, v\x\); 
Set D{vix\,zix\) = 

^ Efc^„l^l,^,|^l D{v\x\,k) + D{u\x\, v\x\) - D{k, u\x\); 
Add u\x\ and v\x\ into X. 
end 



Nevertheless, the main problem with neighbor-joining scheme is that in 
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the first step, the distances are estimated from noisy data and the resulting 
dissimilarity map is therefore very unlikely to be a tree metric. For bio- 
logical sequences, the pairwise distance estimates are typically based on a 
probabilistic model of evolution such as the [Jukes and Cantor, 1969| model: 
given two sequences of length L with k differences between them, the distance 
is estimated as 

Djc = ~ln - 
where V = \- The variance is given by 



Var{D 



JC 



Notice that as p — >• | the variance approaches infinity, which reflects the fact 
that long branch lengths are difficult to resolve with finite sequences. This 
phenomenon exists whenever branch lengths are estimated using Markov 
models of evolution. Although the neighbor-joining algorithm is consistent, 
the fact that dissimilarity maps estimated from data are not tree metrics 
means that there is no guarantee that the algorithm produces the correct 
tree. 

A number of attempts have been made to understand the good results 
obtained with the neighbor-joining algorithm, especially given the problems 
with the inference procedures used for estimating pairwise distances. One of 
the main results is the following: 

Theorem 3 ( [Atteson, 1999| ). Neighbor-joining has loo radius |. 
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This means that if the distance estimates are at most half the minimal 
edge length of the tree away from their true value then the neighbor- joining 
algorithm will reconstruct the correct tree. However, as we will see in section 
4, this criteria is rarely attained even in cases where neighbor-joining has a 
high success rate. 

Despite the unavailability of precise criteria for judging the success of 
neighbor-joining, there have been efforts aimed at improving the distance es- 
timates which form the input to the algorithm. For example, the TRIPLEML 
method [Ranwez and Gascuel, 2002] improves on the pairwise distance esti- 
mates by adjusting them using additional taxa: for each pair of leaves, a third 
leaf is selected and an approximate (numerical) maximum likelihood estimate 
for the branch lengths of the three leaf subtree is computed from which the 
pairwise distance of the original leaves is estimated. In the WEIGHBOR al- 



gorithm Bruno et ai, 2000 , the neighbor-joining criterion is replaced so as 
to weight long branch lengths. These methods, and others similar to them, 
have the drawback that either their performance remains limited by the in- 
herent uncertainty in pairwise distance estimates, or else the simple, natural, 
and mathematically justified structure of the neighbor-joining algorithm is 
abandoned. 

It was suggested in [Pachter and Speyer, 2004| that an alternative encod- 
ing of edge weighted phylogenetic X-trees may be used to improve phylo- 
genetic reconstruction while preserving many of the properties of distance 
based methods. Let X*" denote the mth Cartesian product of X and (j^) all 
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the m element subsets of X. For a phylogenetic X-tree T with i? C X let 
[R] denote the smallest subtree of T spanning R. 

Theorem 4 ( [Pachter and Speyer, 200 4]). Let T be a phylogenetic X- 
tree (\X\ = n) and m > 2 be an integer. Letn > 2m — 1 , and let D : — > M 
be the map R t-^ X]ee[-R] ^^(^) -^^^ each R G (^). Then T is determined by 
the set of values D{R) (and this is not true if 2m — 2 = n > 2). 

Instead of reconstructing trees from dissimilarity maps (m = 2), it was 
suggested that maximum likelihood methods could be used to more accu- 
rately estimate the phylogenetic diversity values D{R) [Faith, 1992| for ev- 
ery R C X, = m. The phylogenetic diversity values are also conve- 
niently called the m-subtree weight values. Such estimates result in 
values which form an m- dissimilarity map, i.e. a function D : X™ R with 
D{x, X, . . . ,x) =0 and D{xi, . . . , Xm) = D{xi^, . . . , for any permutation 
(ii, . . . , Zm) G Sm- The problem is then to develop consistent tree reconstruc- 
tion algorithms that find a tree whose m-subtree weights are "close" to the 
m-dissimilarity map. 

In this paper we propose a practical, efficient method for tree reconstruc- 
tion based on m-dissimilarity maps. We begin by refining theorem |3] and 
show that even ii n < 2m — 1 partial information about the tree is recov- 
erable. We then describe a neighbor-joining algorithm whose cherry picking 
criterion makes use of m-subtree weights. The algorithm is a generalization 
of standard neighbor-joining (in the special case m = 2 the formulas in the 
algorithm simplify to neighbor-joining). It also satisfies many of the same 
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properties: the method is consistent, the input order of the taxa does not 
change the outcome, and the cherry picking criterion is a hnear function of 
the distances. In section 4 we argue that it is more accurate than neighbor- 
joining, and the fact that it is polynomial in the number of taxa means that it 
is practical for the same kinds of large problems for which neighbor-joining is 
used. In fact, the running time for m = 3 is 0{n^), the same as for standard 
neighbor-joining (only with a higher time constant for the initial estimation 
of the weights). 

Our main results depends on yet another encoding of phylogenetic X- 
trees. Given four leaves i,j, fc, / in a phylogenetic X-tree, we use the notation 

\{t,r,k,i)\:= \E{[{t,j}]n[{k,i}])\. 

We say that {i,j;k,l) is a tree quartet if \(i,j]k,l)\ = 0. If q(T) denotes 
the set of tree quartets then there is a partial order < on all X-trees where 
T' <T iff g(T') C q{T). 

Theorem 5 ( |Buneman, 1971t |Semple and Steel, 20031 ). Let T and 

T' he two phylogenetic X -trees. Then q{T) = q{T') iffT = T' . 

2 Tree metrics from m-weights 

Our main results about m-subtree weights are based on a mapping that asso- 
ciates to any m-dissimilarity map a 2-dissimilarity map which, for m-subtree 
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Figure 1: A tree T and four subforests. 

weights from a tree, preserves a certain subforest. This subforest is charac- 
terized by containing those edges whose removal results in sufficiently small 
components in the tree. Specifically, for a tree T, the removal of any edges 
results in two components, and we denote by T<jt the subforest of T whose 
edge set consists of edges whose removal results in one of the components 
having size at most k. For example T<i consists of all the pendant edges 
(adjacent to leaves), and T<k = T for any k > because the removal of 
any edge in a tree leaves a component of size at most For the tree T in 
figure HI with 24 leaves, T = T<i2. 

Theorem 6. Let D be an m- dissimilarity map on a set X of size n and 
define 

Sd{i,3)= D{z,j,Y). (1) 

If D{R) = EeG[i?]'^r(e) for every R e (3 in some edge weighted phylo- 
genetic X-tree T , then Sd is a tree metric. Furthermore, if T' is the tree 
corresponding to this tree metric, then T' < T with T<„_^ = T<n-m o-nd 
there is an invertible linear map between the edge weights in T<n-m one? the 
corresponding edge weights in T<„_^ ( with the exception that in the case that 
T 7^ T<n-m, the pendant edge weights are not uniquely determined.). 

For a fixed tree T and integer m, let S = So where D is the m-dissimilarity 
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map induced by T. Observe that for an edge weighted phylogenetic X-tree, 
T, any hnear combination of the m-subtree weights is a hnear combina- 
tion of the edge weights WT{e) in the tree. For a hnear function on the 
m-subtree weights F : Mirn) let vpie) denote the coefficient of WT(e) 

in F. For instance, vs(i,j){e) denotes the coefficient of ^^(e) in S{i,j). Note 
that vf+g{^) = Vpie) + vaie). We will also use the notation Lj(e) to denote 
the set of leaves in the component of T — e that contains leaf i and Pab is the 
path from vertex a to b. 

Lemma 7. Given a pair of leaves a, b and any edge e we have 



VS{a,b)i^) = < 



Proof: If e is on the path from a to 6, then it will be included in all the 

subtrees [a, b, Y]. If e is not on the the path from a to b, then the only way 
it will be excluded is if all the other leaves fall on the a side of e (which is 
the same as the b side). That is, if y C L„(e) \ {a, b}. There are 
such sets. □ 

Lemma 8. Given a quartet (ai, 02; as, 04) inT with interior vertices bi and 
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^2 (figufG 1), then, 



\m-2j \ m-2 J 



/21' 



' ^('"-^^ _ (n-\U,{e)\-2\ g ^ p ^ . 



^'5(ai,03)+5(a2,a4)(e) — < 



2(ri) 



I 2(:-J-2f-(!r) e^K,a„a3,a4]. 



anc? 



t^5(ai,a4)+5(a2,a3) f^S{ai,a3)+S{a2,a4) 



Figure 2: A quartet (oi, 02; 03, 04) 



Proof: We use the fact that Vs{ai,a2)+S{a3,a4) = vs{ai,a2) + vs{a3,a4) and ap- 
ply the previous lemma. We also note that for e ^ [{oi, 02, 03, 04}], -^ai(e) = 
L^. (e) for all i □ 
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Corollary 9. For a quartet (oi, 02; 03, 04), we define 



S{ai, a2] as, ^4) = S{ai, 02) + S{a^, 04) — S'(ai, 03) — S{a2, a^). 



Then, 



l\La^{e)\-'l\ _ 



\ m-2 J 



'n~\La^{e)\-2- 



m-2 








otherwise. 



Corollary El implies that S satisfies the four-point condition (Q), although 
it may be that f5(aia2;a3a4)(e) = which means that there are interior edges 
in T' which have been collapsed (with length equal to 0). Suppose, however, 
that ( ) G q{T) and [{01,02,03,04}] is in a connected component 

of T<n-m (in other words the subtree spanning the quartet consists of edges 
whose removal leaves a small component). This means that if e G Pbib2 then 
either i^ai(e) > m or n — La^{e) > m and so ^(oi, 02; 03, 04) < which 
means that (oi, 02; 03, 04) G q{T'). Therefore g(T') C q{T) and it follows 
from theorem El that T'^^-m — T<n-m- 

It remains to show that there is an invertible linear map between the edge 
weights in the forests T<n-m and T<„_^: 

Lemma 10. If e is an internal edge ofT<n-m with e' the corresponding edge 
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Figure 3: The quartet (a, 6; c, d) has only the one edge e on its sphtting path. 
in T' then 



2 \\ m — 2 J \ m — 2 

where a is a leaf in one component of T ~ e and c a leaf in the other. 

Proof: Since e is an internal edge, we may choose a, b, c and d such that 
e is the only edge on the splitting path of (a, b; c, d) (figure EI)- Then 



WT'{e') = ^S{a,b;c,d) 



1 //|L„(e)|-2\ /|L,(e)|-2 
2VV m-2 I V m-2 



WT{e). 



□ 



Corollary 11. 



, , 2wT'{e' 
WT[e) = 



La(e)\^2\ , (\Lc{e)\-2\\ 
m-2 / V m-2 / J 



which is well defined if e G T<„_ 



Lemma 12. Denote the edges adjacent to the leaves by ci, . . . ,en (with cor- 
responding edges in T' e[, . . . , e'^) and the set of internal (non-pendant) edges 
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Figure 4: The leaf edge is incident on two other edges. We may choose 
leaves a and h such that Pia H Pi^ = Cj. 

byint{E{T)). Let 



e£int(E{T)) 



n-2 
m — 2 



171 — 2 



an 



d let A be the ma^nx 2("~^Jl + J. Th 

\m—2J Vm— 3/ 



en 



WT'{e[) * 



WTiei) 



1a 

2 



V 



J 



Proof: The interior vertex of an edge e also adjacent to a leaf i is incident 
to two other edges. Choose a leaf a such that P^a intersects one of the edges, 
and b such that Pib intersects the other (figure Ej). Then 



WT'ie') = -iS{z,a) + S{z,b) - S{a,b)) 



which after some algebra gives the above lemma. 



□ 
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Corollary 13. 



( (A 



A 



-1 



2wT'{e[)-Ci 



\ 2WT'{e'J-Cn J 



where A ^ 



I 



m-2 



(m-l)(n-2)" 



In order to recover ^^(e) for every edge, we start by calculating the 
interior edge weights, after which we can calculate the values Cj. The matrix 
A is always invertible if m < n — 1; however, calculating Ci requires that 
int{E{T)) = int{E{T')). If n < 2m — 1, then while we can determine all 
the interior edge weights of T^n-rn from T', it is possible that some interior 
edges of T have been collapsed in T': in particular, the set of edges in 
E{T) \ E{T<n-m)- If E{T) \ E{T<n-m) 7^ 0, then T<n-m is composed of at 
least two connected components and every connected component has strictly 
fewer than m leaves. As a result, every m-subtree weight will include at least 
one undetermined edge, and so there is no way to uniquely determine the 
weights of the pendant edges. 



3 Neighbor-joining with subtree weights 

Theorem IHl forms the basis of the neighbor- joining algorithm with subtree 
weights. First, we need a generalization of the neighbor-joining criterion: 
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Theorem 14 (Cherry Picking Theorem). Let T be an edge weighted 
phylogenetic X-tree with \X\ = n letm be an integer satisfying 2 < m < n—1. 
Let D : X"^ ]R>o be the m- dissimilarity map corresponding to the weights 
of the subtrees of size m in T. IfQ£){x, y) is a minimal element of the matrix 



QD{i,j] 



then x,y is a cherry in the tree T. 

Note that when m = 2 this is exactly the neighbor- joining criterion [Q- 
criterion of theorem |2)) as described by [Studier and Keppler, 1988|. 

Proof: Let S{i,j) = X]yg(^^\{'J>) D{i,j,Y). By theorem IHl we know that 
S" is a tree metric. Observe that 



m — 1 ^-^ ^-^ 

= ^((n-2)^(.,j)-E E D{^,k,Y) 



-E E D{j,k,Y)) 

k 

-E^(^'^)) 



k 

m — I 
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In other words, Qoihi) is just a scalar multiple of the neighbor- joining 
criterion for the tree metric S. By theorem ^{m = 2) we know that the 
minimal element of Qsihj) is a cherry in T' (the tree corresponding to the 
tree metric S). Since m < n — 1, we know that T!^^ is isomorphic to T<i and 
therefore the minimal element of Qnihj) is a cherry. □ 

It follows from theorem El that if m < then the neighbor-joining al- 
gorithm applied directly to S is topologically consistent, i.e. will reconstruct 
the correct tree topology starting with the weights of all subtrees of size m. 
The fact that there is an invertible linear map between for the edge weights, 
means that we can reconstruct T, thus leading to a consistent neighbor join- 
ing algorithm with subtree weights (algorithm 2). 

The running time for computing the weights of the subtrees is 0{Ln"^) 
where / is the length of the alignment and the computation of S{i, j) is 0{rf^) 
(both steps are trivially parallelizable). The subsequent neighbor- joining is 
0{n^) and edge weight reconstruction is O(n^). It is interesting to note that 
for fixed L the running time of the algorithm is O(n^) for both m = 2 and 
m = 3. 

4 Results 

We have implemented the neighbor- joining algorithm for subtree weights in 
a program called MJOIN. The implementation incorporates the fastDNAml 
|01sen et al, 1994| program for computing the subtree weights, and allows 
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Algorithm 2: Neighbor-joining algorithm with subtree weights 
Data : A set X together with sequences corresponding to the 

elements of X 
Result: Edge weighted phylogenetic A-tree T 
for R e (^) do 

I Estimate D[R) using a (numerical) maximum likelihood method; 
end 

for 1,3 e (^) do 

I Set S{i,j) = Eyg(MOj}) D{i,j, Y); 
end 

Apply algorithm 1 (neighbor-joining) to the "distances" S{i,j) 
resulting in tree T'; Set T — T'; 

Spf 1n^(p^ - "iwT'je') . 

Oei Wrya) — ^^\La{e)\-2^_^_(^\Lc{e)\-2^y 

for 1 < i < n do 



Set a = Ee.i 



e&nt(E{T)) 



end 



((r4) - ( 



\Li(e)\-2 
m-2 



Set 



\ WT(en) / 



m-2 



(m-l)(n-2)' 



)) WT(e); 

/ 2wT'(ei) -Ci \ 
\ 2wT'{e'J-Cn J 
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the user to select the sizes of the subtrees to be used. 



Figure 5: Tl and T2 trees of Ota and Li. 

We tested MJOIN with simulated data on the two parameter family of trees 
described by |Uta and Li, 2000| . These are trees for which neighbor-joining 
has difficulty in resolving the correct topology. We simulated 1000 data sets 
on each of the two tree shapes, Ti and T2 (Figures 2, 3) at the three edge 
length ratios, a/b = 0.01/0.07, 0.02/0.19, and 0.03/0.42. This was repeated 
twice for sequences of length 500 and lOOOBP. We also repeated the runs with 
the Kimura 2-parameter model and obtained similar results (not shown). 

Table 1 notes the success rate of MJOIN for m=2, 3, and 4 (denoted by 
j\j-j(m)'j each data set and compares these results to the success rate of other 
tree reconstruction methods. It is clear from the table that as m increases, 
the success rate of MJOIN increases. Hence, for m > 2, MJOIN consistently 
out-performs neighbor-joining (NJ(2)). For the Tl tree, NJ^'') out-performs 
even fast DN Ami. 

Figure 4 shows the standard deviation in the m-weights. We beheve it is 
the relative improvement in the m-weight errors that is contributing to the 
improved performance of MJOIN as m increases. Checking the /qo distance 
of the 2-distance maps from the true tree metric, we find that even in cases 
where neighbor-joining has a high success rate, the number of distance maps 
that satisfy Atteson's condition is fewer than 1%. This suggests that the 
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0.94 
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0.90 


0.96 


0.87 


0.83 


0.92 
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0.90 




0.03/0.42 


0.33 


0.35 


0.52 


0.35 


0.29 


0.38 


0.53 


0.27 


T2 500 


0.01/0.07 


0.82 


0.84 


0.85 


0.86 


0.88 


0.93 


0.86 


0.90 




0.02/0.19 


0.69 


0.72 


0.74 


0.81 


0.89 


0.95 


0.85 


0.90 




0.03/0.42 


0.19 


0.29 


0.36 


0.46 


0.70 




0.47 


0.59 


1000 


0.01/0.07 


0.96 


0.97 


0.98 


0.98 


0.98 


1 


0.97 


0.99 




0.02/0.19 


0.89 


0.92 


0.93 


0.99 


0.99 


1 


0.96 


0.99 




0.03/0.42 


0.40 


0.48 


0.57 


0.75 


0.92 


0.97 


0.70 


0.90 



Table 1: Simulations with the Jukes-Cantor model. Nj("*) = MJOIN with 
subtree size m; BN = BioNJ; WB = Weighbor; NM = NJML (NM); QP = 
the quartet puzzhng algorithm; FM = fastDNAml. 

success of neighbor joining is due to other favorable features of the projection, 
and we believe that a deeper understanding of neighbor joining is necessary 
in order to rigorously understand the reasons for the improvements with 
m-subtree weights. 



Figure 6: Standard Deviation as a percent of total weight. For the Jukes- 
Cantor method, sequence length of 500BP, m=2,3,4 and subtrees drawn from 
Ti and T2. 
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5 Discussion 



Theorem El establishes that pairwise distance based reconstruction methods 
can be used to reconstruct trees from m-subtree weights. This immediately 
suggests a number of potential improvements to the algorithm we have de- 
scribed. For example, by taking into account the variances of the S{i,j), it 
should be possible to improve on the neighbor-joining algorithm for subtree 
weights with better agglomeration (as is done in BIONJ). 

In tests we performed with n = 10 taxa and m = 5 (results not re- 
ported) we observed a deterioration in the accuracy of the tree reconstruc- 
tion algorithm, which we attribute to inaccuracies in the subtree weights 
estimated with fastDNAml. In fact, tests with fastDNaml on five taxa re- 
vealed that the algorithm fails to even reconstruct the correct tree topology 
a significant fraction of the time. Thus, we believe that until further im- 
provements are made in ML estimation of trees, the best subtree weight size 
to use will be m = 4. We are encouraged by various efforts in this direction 
Contois and Levy, 2005 Ho§ten et ai, 20031 . 



We have found subtree weight reconstruction to be practical and efficient 
for much larger examples than described here. We have run the algorithm 
with m = 3 on trees of up to 50 taxa on a standard PC, and it is worth noting 
that for larger problems it is trivial to parallelize the m-weight estimation. 
Thus, we believe that our method is practical and recommended for large tree 
constructions that currently rely on either a pairwise distance method, or a 
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heuristic maximum likelihood search. Since the latter can fail with regularity 
on trees with only five taxa, it is unlikely to be accurate for large trees. 

Our investigations have opened up a number of interesting questions. For 
example, it would be useful to obtain an analog of the four point condition 
that characterizes the space of m-dissimilarity maps arising from trees. It 
would also be of interest to develop a subtree- weight analog of the Neighbor- 
Net algorithm [Bryant and Moulton, 2004| . 

Finally, we point out that our results can be viewed as providing ap- 
proximations to maximum-likelihood tree reconstruction by refining distance- 
based methods. We believe that a deeper understanding of m-dissimilarity 
maps should yield further results in this direction. 
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